Skip to content

fix(test): resolve thread leak failures in CI#1533

Open
SUMMERxYANG wants to merge 7 commits intodevfrom
summer/fix/thread-leak-tests
Open

fix(test): resolve thread leak failures in CI#1533
SUMMERxYANG wants to merge 7 commits intodevfrom
summer/fix/thread-leak-tests

Conversation

@SUMMERxYANG
Copy link
Contributor

@SUMMERxYANG SUMMERxYANG commented Mar 12, 2026

Problem

test_process_crash_triggers_stop can flakily fail CI with Non-closed threads created during this test for threads run_forever, _lcm_loop, _watch_process.

The test waits for the watchdog to detect a process crash and call stop(), but the watchdog calls stop() from its own thread and can't join itself — so its thread (and sometimes the LCM/event-loop threads) may still be alive when the monitor_threads fixture checks.

Solution

Convert the test to use a pytest fixture that calls mod.stop() in teardown. This joins all threads from the test's main thread. stop() is idempotent so it's safe even after the watchdog already called it.

Breaking Changes

None

How to Test

uv run pytest dimos/core/test_native_module.py::test_process_crash_triggers_stop -v --noconftest --timeout=30

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR fixes two pre-existing thread-leak CI failures by (1) adding an explicit mod.stop() at the end of test_process_crash_triggers_stop so the LCM-loop and event-loop threads are properly joined from the test thread (the watchdog cannot join itself, but the base Module._close_module() is idempotent via _module_closed_lock, so the second call is safe), and (2) adding a regex filter in the monitor_threads conftest fixture to ignore daemon threads whose names match the bare Thread-\d+ pattern, which targets unnamed torch/HuggingFace threads that have no public cleanup API.

Key changes and observations:

  • test_native_module.py: mod.stop() is correctly idempotent — NativeModule.stop() checks _stopping and _module_closed_lock prevents _close_module from executing twice. The fix correctly targets the run_forever / _lcm_loop threads by ensuring super().stop() is called from a thread that isn't the watchdog.
  • conftest.py: The Thread-\d+ daemon filter depends on the Python 3.12+ thread-naming convention (where threading.Thread(target=fn) produces names like "Thread-N (fn_name)"). On Python < 3.12, project-owned daemon threads without an explicit name= would share the same Thread-N format and would be silently filtered, masking real leaks. If CI is exclusively Python 3.12+, this is harmless, but the assumption is currently undocumented.

Confidence Score: 4/5

  • Safe to merge; both fixes are correct and targeted, with one minor assumption about Python 3.12+ thread-naming that could hide leaks on older interpreters.
  • The mod.stop() addition in the test is well-understood and safe — _module_closed_lock ensures idempotency, and the @rpc decorator is a pure marker (no async dispatch). The conftest filter is narrow (only daemon threads, only exact Thread-\d+ format), but implicitly relies on Python 3.12+ naming semantics to distinguish third-party threads from project threads. The risk is a silent regression in thread-leak detection if the project ever adds unnamed daemon threads or runs on Python < 3.12, not a breakage in production behaviour.
  • dimos/conftest.py — the new daemon-thread filter should document its Python 3.12+ assumption or guard against older interpreters.

Important Files Changed

Filename Overview
dimos/conftest.py Adds a filter in monitor_threads to skip daemon threads with generic Thread-\d+ names; relies on Python 3.12+ thread-naming behaviour to avoid masking own-code leaks.
dimos/core/test_native_module.py Adds explicit mod.stop() after the crash-detection assertion so the LCM-loop and event-loop threads are properly joined from the test thread, fixing thread-leak failures in CI.

Sequence Diagram

sequenceDiagram
    participant Test as Test Thread
    participant Watchdog as Watchdog Thread (_watch_process)
    participant Process as Native Subprocess
    participant Module as NativeModule.stop()
    participant Base as Module._close_module()

    Test->>Process: mod.start() → subprocess spawned
    Test->>Watchdog: watchdog thread started (daemon=True)
    Process-->>Watchdog: process exits (die_after=0.2)
    Watchdog->>Module: self.stop() [_stopping=False → crash path]
    Module->>Module: _stopping = True
    Module->>Module: self._watchdog is current_thread() → skip join
    Module->>Module: self._watchdog = None, self._process = None
    Module->>Base: super().stop() → _close_module()
    Base->>Base: _module_closed_lock acquired, joins run_forever / _lcm_loop threads
    Watchdog-->>Watchdog: _watch_process() returns → thread terminates

    Test->>Test: poll: mod._process is None → break
    Test->>Module: mod.stop() [idempotent]
    Module->>Module: _stopping already True, _process already None, _watchdog already None
    Module->>Base: super().stop() → _close_module()
    Base->>Base: _module_closed already True → early return (no-op)
    Note over Test,Base: monitor_threads fixture sees no leaked threads ✓
Loading

Last reviewed commit: 021be1f

Comment on lines 164 to 172
# Filter out third-party daemon threads with generic names (e.g. "Thread-109").
# On Python 3.12+ our own threads include the target function name in parens
# (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads
# from libraries like torch/HuggingFace that have no cleanup API.
new_threads = [
t for t in new_threads if not (t.daemon and re.fullmatch(r"Thread-\d+", t.name))
]

# Filter out threads we've already seen (from previous tests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter silently suppresses leaks on Python < 3.12

The comment correctly explains the Python 3.12+ behaviour where threading.Thread(target=fn) embeds fn.__name__ in the thread name (e.g. "Thread-166 (run_forever)"). On Python 3.11 and earlier, auto-named threads are just "Thread-N" regardless of whether a target was passed, so every daemon thread started by project code without an explicit name= argument would also be silently skipped by this filter.

If the CI exclusively runs Python 3.12+ this is fine as written, but it's worth documenting that assumption or adding a Python-version guard so the filter doesn't inadvertently hide regressions on older interpreters:

import sys

_FILTER_GENERIC_DAEMON_THREADS = sys.version_info >= (3, 12)

# Filter out third-party daemon threads with generic names (e.g. "Thread-109").
# On Python 3.12+ our own threads include the target function name in parens
# (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads
# from libraries like torch/HuggingFace that have no cleanup API.
# NOTE: On Python < 3.12 all auto-named threads share this format, so the
# filter is intentionally disabled there to preserve leak detection.
if _FILTER_GENERIC_DAEMON_THREADS:
    new_threads = [
        t
        for t in new_threads
        if not (t.daemon and re.fullmatch(r"Thread-\d+", t.name))
    ]

@spomichter
Copy link
Contributor

im cnfused here tho becuase it only fails CI SOME of the time

@spomichter
Copy link
Contributor

@paul-nechifor would know why this is flakey

# On Python 3.12+ our own threads include the target function name in parens
# (e.g. "Thread-166 (run_forever)"), so this only matches unnamed threads
# from libraries like torch/HuggingFace that have no cleanup API.
new_threads = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sweeping the problem under the rug. If there's a specific thread leak we can't fix because it's from a third party library, we should ignore that issue alone, not all generic thread names, because we use generic thread names too.

@paul-nechifor paul-nechifor changed the base branch from main to dev March 12, 2026 20:54
@paul-nechifor
Copy link
Contributor

paul-nechifor commented Mar 12, 2026

@paul-nechifor would know why this is flakey

dimos/core/test_native_module.py is a good fix (although a fixture would be better), but the other change is not. Do you know where those tests are failing in CI? I haven't seen it.

SUMMERxYANG and others added 6 commits March 12, 2026 16:05
- Add mod.stop() to test_process_crash_triggers_stop so watchdog, LCM,
  and event-loop threads are properly joined from the test thread
- Filter third-party daemon threads with generic names (Thread-\d+) in
  conftest monitor_threads to ignore torch/HF background threads that
  have no cleanup API
Convert test_process_crash_triggers_stop to use a fixture that calls
mod.stop() in teardown. The watchdog thread calls self.stop() but can't
join itself, so an explicit stop() from the test thread is needed to
properly clean up all threads.

Drop the broad conftest regex filter for generic daemon thread names
per review feedback.
mod.stop() is a no-op when the watchdog already called it, so
capture thread IDs before the test and join new ones in teardown.
@SUMMERxYANG SUMMERxYANG force-pushed the summer/fix/thread-leak-tests branch from 3f2eb91 to 3197ad3 Compare March 12, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants